Combining Lexical and Formatting Cues for Named Entity Acquisition from the Web
نویسندگان
چکیده
Because of their constant renewal, it is necessary to acquire fresh named entities (NEs) from recent text sources. We present a tool for the acquisition and the typing of NEs from the Web that associates a harvester and three parallel shallow parsers dedicated to specific structures (lists, enumerations, and anchors). The parsers combine lexical indices such as discourse markers with formatting instructions (HTML tags) for analyzing enumerations and associated initializers.
منابع مشابه
A High-Accurate Chinese-English NE Backward Translation System Combining Both Lexical Information and Web Statistics
Named entity translation is indispensable in cross language information retrieval nowadays. We propose an approach of combining lexical information, web statistics, and inverse search based on Google to backward translate a Chinese named entity (NE) into English. Our system achieves a high Top-1 accuracy of 87.6%, which is a relatively good performance reported in this area until present.
متن کاملCombining n-gram based statistics with traditional methods for named entity recognition
In this paper, we show three main results. First, we show that an n-gram dataset built from a large web crawl, as opposed to data from the specific target domain, can be used to perform the task of named entity recognition with reasonable accuracy. Second, we show that for complex domains, such as the MUC-7 NER task, the Lex method may not perform as well as other methods, due largely in part t...
متن کاملMultiobjective Optimization and Unsupervised Lexical Acquisition for Named Entity Recognition and Classification
In this paper, we investigate the utility of unsupervised lexical acquisition techniques to improve the quality of Named Entity Recognition and Classification (NERC) for the resource poor languages. As it is not a priori clear which unsupervised lexical acquisition techniques are useful for a particular task or language, careful feature selection is necessary. We treat feature selection as a mu...
متن کاملPAYMA: A Tagged Corpus of Persian Named Entities
The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...
متن کاملProduction of English Lexical Stress by Persian EFL Learners
This study examines the phonetic properties of lexical stress in English produced by Persian speakers learning English as a foreign language. The four most reliable phonetic correlates of English lexical stress, namely fundamental frequency, duration, intensity, and vowel quality were measured across Persian speakers’ production of the stressed and unstressed syllables of five English disyllabi...
متن کامل